Skip to content

fix(service): recover stalled WiFi/TCP handshakes by cycling active transport#5856

Merged
jamesarich merged 4 commits into
meshtastic:mainfrom
jeremiah-k:bugfix/wifi-handshake-stall-restart
Jun 20, 2026
Merged

fix(service): recover stalled WiFi/TCP handshakes by cycling active transport#5856
jamesarich merged 4 commits into
meshtastic:mainfrom
jeremiah-k:bugfix/wifi-handshake-stall-restart

Conversation

@jeremiah-k

@jeremiah-k jeremiah-k commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR improves WiFi/TCP handshake-stall recovery so the app and underlying transport do not get stuck in different connection states.

Previously, if the handshake stalled after its retry window, recovery could leave the app-level connection state and underlying transport out of sync. For TCP, the transport could remain active while the UI reported a disconnected state. In that situation, selecting the same node again could be ignored because the radio service still saw the same selected address and an already-running transport.

This PR adds a targeted transport restart path for stalled handshakes. When recovery is needed, the connection manager first moves app state to Disconnected, then asks the radio service to restart the active transport. The fresh transport Connected signal can then re-enter the normal handshake flow.

Follow-up WiFi testing also showed that some failed TCP/USB handshake attempts could sit in a loading state until the lower-level TCP timeout recovered them. This PR adds a TCP/USB-only fast handshake watchdog so those stalled handshakes recover sooner, while leaving BLE's longer handshake timing unchanged.

Additional BLE testing exposed a separate post-handshake stuck-connecting path. Once firmware sends NODE_INFO_NONCE, the handshake watchdog is intentionally cancelled before local NodeDB install work begins. If that local NodeDB install fails, the app can otherwise remain in Connecting with no active watchdog. This PR now recovers from NodeDB install failure by routing through the same reconnecting transport-restart path.

This PR also adds defensive recovery hardening so a crashing or repeatedly failing node is not re-driven indefinitely. Recovery now uses clean transport restart only, applies exponential backoff between repeated attempts, and surfaces a sticky error after repeated unsuccessful recovery cycles.

Addresses: #3727

Key Changes

Handshake Recovery

  • Adds RadioInterfaceService.restartTransport() for silent recovery of stalled mesh handshakes.
  • Updates MeshConnectionManagerImpl so retry-exceeded handshake stalls request a transport restart instead of only changing app-level state.
  • Adds MeshConnectionManager.recoverPostHandshakeFailure() so NodeDB install failures after firmware handshake completion use the same app-level disconnect-before-restart ordering.
  • Adds a 12s TCP/USB-only fast handshake watchdog for Stage 1 and Stage 2.
  • Resets the fast watchdog only on meaningful handshake progress.
  • Cancels the Stage 2 watchdog synchronously when NODE_INFO_NONCE arrives, before async NodeDB install work begins.
  • Adds a one-way handshake-complete latch so late progress packets cannot re-arm the watchdog after firmware handshake completion.
  • Recovers only from pre-Connected NodeDB install failure; post-Connected analytics / side-effect failures are logged without forcing recovery.
  • Orders recovery so the app-level Disconnected transition happens before transport restart emissions.
  • Runs recovery work in a sibling coroutine so the restart sequence is not lost when the existing handshake timeout job is cancelled.
  • Re-checks the current app state before recovery so a late-arriving successful handshake is not restarted unnecessarily.
  • Uses atomic watchdog swaps so concurrent re-arms cannot orphan timeout jobs.

Recovery Hardening

  • Removes the same-session want_config retry path.
  • Uses clean transport restart for stalled recovery on both fast transports and BLE retry-exhausted paths.
  • Adds exponential backoff between consecutive recovery attempts.
  • Adds a consecutive recovery cap.
  • Surfaces a localized sticky error after repeated unsuccessful recovery attempts instead of silently looping forever.
  • Resets the recovery failure counter after successful handshake completion.

Transport Restart Behavior

  • Implements SharedRadioInterfaceService.restartTransport() as an in-place stop/start of the currently active transport.
  • Preserves the selected device address and existing connection intent.
  • Emits a transient DeviceSleep transport transition before restarting so observers can see a real state change before the fresh Connected event.
  • Keeps recovery quiet from the user's perspective by avoiding a permanent transport Disconnected event and avoiding a user-facing error.
  • Skips restart when:
    • the user explicitly disconnected,
    • the selected device was cleared,
    • no valid selected address exists,
    • no transport is currently running,
    • another restart or liveness recovery is already in progress.

Handshake Progress

  • Tracks meaningful Stage 1 progress from MyNodeInfo, local metadata, file info, device config, module config, channels, DeviceUI config, and early NodeInfo.
  • Tracks Stage 2 progress from NodeInfo packets.
  • Buffers early NodeInfo received during Stage 1 so it is not lost before the node-list phase starts.
  • Keeps Stage 2 completion gated on NODE_INFO_NONCE.

Connection UI

  • Adds a transient Reconnecting… state while silent handshake recovery is in progress.
  • Keeps transient TCP/USB handshake recovery from looking like a final user-facing disconnect.

Test Coverage

  • Adds MeshConnectionManagerImplTest coverage for:

    • Stage 1 config handshake stall recovery,
    • Stage 2 node-info handshake stall recovery,
    • TCP/USB fast recovery timing,
    • meaningful progress resetting the fast watchdog,
    • no fast recovery for BLE,
    • no restart when the handshake completes normally,
    • recovery ordering so app disconnect is processed before restart emissions,
    • post-handshake failure recovery using the same disconnect-before-restart ordering,
    • exponential recovery backoff,
    • consecutive recovery exhaustion surfacing a sticky error,
    • successful handshake completion resetting the recovery counter.
  • Adds MeshConfigFlowManagerImplTest / MeshConfigHandlerImplTest coverage for:

    • meaningful handshake progress signals,
    • early NodeInfo buffering,
    • Stage 2 completion behavior,
    • watchdog cancellation when NODE_INFO_NONCE arrives,
    • NodeDB install failure triggering post-handshake recovery,
    • post-NodeDB analytics / side-effect failures not forcing recovery after Connected.
  • Adds SharedRadioInterfaceServiceLivenessTest coverage for:

    • active transport restart creates a fresh transport,
    • restart is a no-op after explicit disconnect,
    • restart is a no-op after device deselection,
    • selected address is preserved,
    • restart does not emit permanent Disconnected,
    • restart does not emit a user-facing connection error,
    • restart coordinates with in-flight liveness recovery,
    • restart does not bypass environmental recovery when the transport was already stopped.
  • Adds Connections UI/ViewModel coverage for:

    • transient Reconnecting… mapping.

Follow-up

This PR addresses the Android-side handshake-stall recovery path associated with the reported WiFi/TCP reconnect behavior.

During testing, this also uncovered a firmware-side crash triggered by repeated or re-entered config/node-list startup on firmware build tbeam / v2.7.25.104df5f. The firmware-side fix is being tracked separately in meshtastic/firmware#10754. This PR adds app-side defensive depth so Android does not keep re-driving a node that repeatedly fails recovery.

If field logs still show periodic TCP reconnect or configuration loops after this lands, a follow-up investigation should look at TCP idle-timeout behavior and add privacy-safe disconnect-reason logging for EOF, IOException, write failure, and idle timeout.


I mostly run nRF52 devices over BLE, so I do not have a strong baseline for long-term WiFi behavior. I tested these changes with WiFi-connected ESP32 devices while reconnecting, scanning, and disabling/re-enabling WiFi. I also tested the post-handshake recovery and recovery-hardening changes on my BLE device after the T1000-E stuck-connecting behavior surfaced during review.

@github-actions github-actions Bot added the bugfix PR tag label Jun 18, 2026
@jeremiah-k jeremiah-k marked this pull request as draft June 18, 2026 20:11
@jamesarich jamesarich added skip-preview-check Dismiss the advisory preview-staleness check (no UI preview update needed) skip-docs-check Dismiss the advisory docs-staleness check (no user/developer doc update needed) labels Jun 18, 2026
@jeremiah-k jeremiah-k changed the title fix(service): Fix WiFi handshake stall recovery by cycling active transport fix(service): recover stalled WiFi/TCP handshakes by cycling active transport Jun 19, 2026
@jeremiah-k

Copy link
Copy Markdown
Contributor Author

@jamesarich One thing worth noting from my WiFi testing, this app side fix may expose a firmware crash on some affected builds.

On a tbeam running firmware 2.7.25.104df5f, the app side fix made the connection flow behave correctly and retry/reconnect cleanly. Once that happened, the node started rebooting consistently during the normal config/node-list load. Before this fix, the app could sometimes look connected or get stuck in a partial state, so it may not have hit the crashing firmware path as reliably.

It looks like the app is now reaching the proper handshake path consistently and that path exposes a firmware-side crash. I have a patch in progress around PhoneAPI::handleStartConfig() to avoid the abort and I'll open a PR to the firmware repo for that soon.

As far as this PR goes the debug build is behaving well with the patched firmware, so I think it's ready for review.

@jeremiah-k jeremiah-k marked this pull request as ready for review June 19, 2026 16:33
@jeremiah-k jeremiah-k marked this pull request as draft June 19, 2026 20:20
@jeremiah-k jeremiah-k marked this pull request as ready for review June 19, 2026 22:16

@jamesarich jamesarich left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough work here, and especially for the field testing + the firmware-crash writeup — that investigation is genuinely valuable. The core fix is correct: the WiFi/TCP split-brain (transport Connected while the app is Disconnected, blocking same-node reconnect) is the right diagnosis, and cycling the transport in place is the right shape. Test coverage is strong.

Two things before this is ready to merge:

1. Scope — this PR has outgrown its title. It's now +2328/−120 across 26 files / 20 commits, several unrelated to handshake recovery. Please split into their own PR(s):

  • fix(data): make device link catalog refresh transactional
  • fix(data): separate refresh timeouts from Room persistence (DeviceHardwareRepository / DeviceLink / StaleWhileRevalidate)
  • the isWifiUnavailable / VPN-banner work (3 commits) — a distinct bug from handshake recovery

Also: squash the restart transport after post-handshake failurenarrow post-handshake recovery boundary pair, and a rebase would drop the 6 Merge branch 'main' commits. A core-connectivity change is far easier to review for regressions when it's just the recovery fix.

2. Recovery hardening (ties to your firmware finding). I dug into why the clean handshake trips the T-Beam reboot: firmware handleStartConfig() (build v2.7.25.104df5f) re-enters on every want_config with no in-flight guard and never resets its config_state sub-iterator. Android's stall-guard re-sends want_config on a timer and re-handshakes with no backoff/cap — that re-send is what re-enters the fragile firmware path mid node-list load. (The firmware write-dedup is single-slot memcmp vs the previous write only, so the interleaved heartbeats let the re-sent nonce slip past it — the "firmware will drop our retry" assumption doesn't hold.) Your firmware patch is the proper fix; a small app-side backoff + give-up cap is good defensive depth so a crashing node isn't re-driven every cycle. Details inline.

Everything else is minor (inline). Nice work overall — this is close.

Comment thread core/ui/src/commonMain/kotlin/org/meshtastic/core/ui/util/NetworkTransportInfo.kt Outdated
@jeremiah-k

Copy link
Copy Markdown
Contributor Author

@jamesarich Thanks - I split the independent scope into two separate PR branches and kept this PR focused on the core handshake recovery work.

New split branches:

  1. bugfix/data-refresh-timeouts

    • separates remote API timeouts from local Room persistence
    • touches DeviceHardwareRepository, DeviceLinkRepository, StaleWhileRevalidateFlow, and the related repository test
    • one clean commit, 4 files
  2. bugfix/vpn-network-scan-transport

    • checks all current networks for scan availability
    • treats WiFi, Ethernet, or VPN as valid network-scan transports
    • keeps cellular-only unavailable
    • removes the unused hasCellular field
    • one clean commit, 5 files

I also addressed the recovery-hardening feedback locally on the core branch:

  • added exponential recovery backoff
  • added a consecutive recovery cap with a localized sticky error
  • removed the same-session want_config retry path
  • added a one-way handshake-complete latch so late packets cannot re-arm the watchdog after NODE_INFO_NONCE
  • converted handshake watchdog re-arms to atomic swaps
  • fixed the CI/test fallout by correcting the Kermit severity enum, satisfying detekt, isolating test fixtures, and stabilizing coroutine timing assertions

Once the two smaller PRs land, I’ll rebase this PR onto current main, drop the merge commits and dead transactional device-link commit, fold the CI/test-fix commits into their parent changes, and force-push a tightened core-connectivity PR.

When a handshake stalls and recovery is needed, the connection manager must
be able to cycle the active transport so a fresh Connected signal can
re-enter the handshake flow. Previously there was no way to restart the
transport without a full disconnect/reconnect cycle, leaving the transport
Connected while the app reported Disconnected — a TCP split-brain that
blocked same-node reconnect because the radio service still saw the same
selected address and an already-running transport.

restartTransport is an in-place stop/start of the currently active transport:
- Preserves the selected device address and connection intent
- Emits a transient DeviceSleep transition so observers see a real state change
- Keeps recovery quiet from the user (no permanent Disconnected, no error)
- Skips restart on explicit disconnect, device deselection, missing address,
  no running transport, or when another restart is already in progress
- Coordinates with in-flight liveness recovery via an isRestarting CAS guard
When a handshake stall is detected, request a transport restart via
runSiblingHandshakeRecovery instead of only changing app-level state.
The sibling coroutine is parented to scope (not handshakeTimeout) so
the restart sequence survives cancellation of the expired watchdog job.

Recovery hardening:
- Exponential backoff (2s base, doubling, 30s cap) between recovery attempts
- After 3 consecutive failures, surface a localized sticky error and stop
  retrying so a crashing node is not re-driven indefinitely. Counter resets
  on successful handshake or sticky-error surface.
- Removed mid-session want_config re-send that re-entered firmware's
  fragile handleStartConfig() path (root cause of T-Beam reboots under
  retry). Both BLE and fast transports now recover via clean transport restart.
- One-way handshakeCompleteLatch prevents late-arriving FileInfo/config
  packets from re-arming the watchdog after NODE_INFO_NONCE
- All handshakeTimeout re-arm sites use atomic getAndSet swap to prevent
  orphaned jobs between cancel and reassign
- SafeCatchingAll wraps the localized error resolution so headless JVM
  tests without Skiko do not silently swallow setErrorMessage

Concurrent recovery is guarded by three layers: connectionMutex serializes
the Disconnected transition, the fromState=Connecting check rejects
duplicate siblings, and the transport-level isRestarting CAS is authoritative.
…ures

Emit onHandshakeProgress from meaningful Stage 1 packet handlers (MyNodeInfo,
local metadata, file info, device config, module config, channels, DeviceUI
config, early NodeInfo) and Stage 2 handlers (NodeInfo packets) so the
connection manager can reset the fast watchdog on real progress.

Buffer early NodeInfo received during Stage 1 so it is not lost before the
node-list phase begins. Stage 2 completion remains gated on NODE_INFO_NONCE.
The watchdog is cancelled synchronously when NODE_INFO_NONCE arrives, before
async NodeDB install work begins.

If NodeDB install fails after the firmware handshake completed, recover via
recoverPostHandshakeFailure which routes through the same disconnect-before-
restart path as a stalled handshake. Only pre-Connected failures trigger
recovery; post-Connected analytics/side-effect failures are logged without
forcing a restart.
Add a transient RECONNECTING display state while silent handshake recovery
is in progress so the user sees activity rather than a frozen Connecting
spinner. The state maps from Connecting + RECONNECTING_PROGRESS_TEXT and
does not produce a user-facing disconnect or error.

Add a localized Res.string.error_recovery_exhausted for the sticky error
surfaced when recovery gives up after repeated failures. Resolved via
getStringSuspend in the connection manager's recovery coroutine.

Extract RECONNECTING_PROGRESS_TEXT as a shared constant in
ServiceRepository so both the connection manager (writer) and the
ConnectionsViewModel (reader) reference the same contract.
@jeremiah-k jeremiah-k force-pushed the bugfix/wifi-handshake-stall-restart branch from 6c49ec5 to 221fff8 Compare June 20, 2026 18:36
@jeremiah-k

jeremiah-k commented Jun 20, 2026

Copy link
Copy Markdown
Contributor Author

@jamesarich Resolved.

The scope-related changes were split out and landed separately (#5881 and #5882), the merge/dead commits were removed during the rebase, and the remaining branch is now focused on the connection recovery work.

The recovery hardening feedback has also been incorporated: recovery now uses transport restart with exponential backoff and a give-up cap, the same-session want_config retry path was removed, and the watchdog/latch changes prevent late progress from re-arming recovery after handshake completion.

I also addressed the remaining inline review comments during the cleanup and rebase. Please let me know if I missed anything.

Testing is continuing to go well for the firmware-side fix. I'm preparing the associated firmware PR and will link it once it's up.

@jeremiah-k jeremiah-k requested a review from jamesarich June 20, 2026 19:00
@jeremiah-k jeremiah-k marked this pull request as ready for review June 20, 2026 19:02
@jamesarich jamesarich added this pull request to the merge queue Jun 20, 2026
Merged via the queue into meshtastic:main with commit 3ca87fa Jun 20, 2026
23 checks passed
@jeremiah-k jeremiah-k deleted the bugfix/wifi-handshake-stall-restart branch June 20, 2026 19:18
@jeremiah-k

Copy link
Copy Markdown
Contributor Author

Quick follow-up: I opened the firmware-side PR for the crash that came up while testing this:

meshtastic/firmware#10754

That PR handles the tbeam / v2.7.25.104df5f crash/reboot I was seeing once Android reliably reached the Stage 2 node-info/config path. This Android PR keeps the app-side recovery/backoff behavior, while the firmware PR addresses the underlying firmware crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix PR tag skip-docs-check Dismiss the advisory docs-staleness check (no user/developer doc update needed) skip-preview-check Dismiss the advisory preview-staleness check (no UI preview update needed)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants